Runtime failure management of redundantly deployed hosts of a supervisory process control data acquisition facility

ABSTRACT

A redundant host pair runtime arrangement is disclosed for a process control network environment. The arrangement includes a primary network. A first partner of a fail-over host pair operates on a first machine communicatively connected to the primary network. The first partner hosts a set of executing application components in accordance with an active role assigned to the first partner. A second partner of the fail-over host pair operates on a second machine communicatively connected to the primary network. The second partner hosts a non-executing version of the set of executing application components in accordance with a standby runtime role. A monitoring process, operating separately upon the first machine, senses a failure of the first partner, and in response, initiates a notification to the second partner to take over the active role.

TECHNICAL FIELD

The present invention generally relates to the field of networkedcomputerized process control systems. More particularly, the presentinvention relates to supervisory process control and manufacturinginformation systems. Such systems generally execute above a controllayer in a process control system to provide guidance to lower levelcontrol elements such as, by way of example, programmable logiccontrollers.

BACKGROUND

Industry increasingly depends upon highly automated data acquisition andcontrol systems to ensure that industrial processes are run efficiently,safely and reliably while lowering their overall production costs. Dataacquisition begins when a number of sensors measure aspects of anindustrial process and periodically report their measurements back to adata collection and control system. Such measurements come in a widevariety of forms. By way of example the measurements produced by asensor/recorder include: a temperature, a pressure, a pH, a mass/volumeflow of material, a tallied inventory of packages waiting in a shippingline, or a photograph of a room in a factory. Often sophisticatedprocess management and control software examines the incoming data,produces status reports, and, in many cases, responds by sendingcommands to actuators/controllers that adjust the operation of at leasta portion of the industrial process. The data produced by the sensorsalso allow an operator to perform a number of supervisory tasksincluding: tailor the process (e.g., specify new set points) in responseto varying external conditions (including costs of raw materials),detect an inefficient/non-optimal operating condition and/or impendingequipment failure, and take remedial actions such as move equipment intoand out of service as required.

Typical industrial processes are extremely complex and receivesubstantially greater volumes of information than any human couldpossibly digest in its raw form. By way of example, it is not unheard ofto have thousands of sensors and control elements (e.g., valveactuators) monitoring/controlling aspects of a multi-stage processwithin an industrial plant. These sensors are of varied type and reporton varied characteristics of the process. Their outputs are similarlyvaried in the meaning of their measurements, in the amount of data sentfor each measurement, and in the frequency of their measurements. Asregards the latter, for accuracy and to enable quick response, some ofthese sensors/control elements take one or more measurements everysecond. When multiplied by thousands of sensors/control elements, thisresults in so much data flowing into the process control system thatsophisticated data management and process visualization techniques arerequired.

Highly advanced human-machine interface/process visualization systemsexist today that are linked to data sources such as the above-describedsensors and controllers. Such systems acquire and digest (e.g., filter)the process data described above. The digested process data in-turndrives a graphical display rendered by a human machine interface. Anexample of such system is the well-known Wonderware IN-TOUCH®human-machine interface (HMI) software system for visualizing andcontrolling a wide variety of industrial processes. An IN-TOUCH HMIprocess visualization application includes a set of graphical views of aparticular process. Each view, in turn, comprises one or more graphicalelements. The graphical elements are “animated” in the sense that theirdisplay state changes over time in response to associated/linked datasources. For example, a view of a refining process potentially includesa tank graphical element. The tank graphical element has a visualindicator showing the level of a liquid contained within the tank, andthe level indicator of the graphical element rises and falls in responseto a steam of data supplied by a tank level sensor indicative of theliquid level within the tank. Animated graphical images driven byconstantly changing process data values within data streams, of whichthe tank level indicator is only one example, are considerably easierfor a human observer to comprehend than a steam of numbers. For thisreason process visualization systems, such as IN-TOUCH, have becomeessential components of supervisory process control and manufacturinginformation systems.

Loss of data access to a process control system essentially blinds theHMI systems, and thus human managers, to the current status of a processcontrol system. Therefore, maintaining reliable uninterrupted access bythe above-described HMI systems to process control elements is veryimportant, if not essential to the overall viability of a supervisoryprocess control system. As a result, many systems incorporateredundancy, and an automated fail-over mechanism, into theirdata/control paths to ensure that human access to an automated processcontrol system is not disrupted due to a single path/machine failure.

Such redundancy/fail-over functionality has been implemented in systemswherein duplicate components operate in parallel on separate machines ina same network area. In one redundant data delivery host implementationa second data delivery host system operates as an equivalent copy of theprimary data delivery host system. Such implementation requiredduplicated communications, hardware, and software. Furthermore, theredundancy was not transparent to the clients of the data deliverysystem. As a result, each of the clients of the redundant data deliverysystem was required to be aware of the distinctly identified/namedactive and standby systems. Configuring/implementing/relocatingredundant hosts in such systems substantially increases the cost of thesystem and the networks within which such systems operate.

SUMMARY OF THE INVENTION

The present invention addresses the potential need to provide betterways of implementing redundancy in hosts (e.g., data/message deliveryservers/services) residing and operating within a supervisory processcontrol environment supporting, by way of example, visualizationapplications for monitoring and managing elements of controlledindustrial processes. The present invention facilitates configuring anddeploying a redundant host pair in a supervisory process control andmanufacturing information system wherein specified ones of the redundanthost pair have equivalent capabilities, but function differently inaccordance with distinct roles taken by the partners of the redundantpair in a runtime environment.

A redundant host pair runtime arrangement is disclosed for a processcontrol network environment. The arrangement includes a primary network.A first partner of a fail-over host pair operates on a first machinecommunicatively connected to the primary network. The first partnerhosts a set of executing application components in accordance with anactive role assigned to the first partner. A second partner of thefail-over host pair operates on a second machine communicativelyconnected to the primary network. The second partner hosts anon-executing version of the set of executing application components inaccordance with a standby runtime role. A monitoring process, operatingseparately upon the first machine, senses a failure of the firstpartner, and in response, initiates a notification to the second partnerto take over the active role.

Other inventive aspects of the systems and methods disclosed hereinaddress the configuration of such systems as well as their runtimebehavior, including the content of the synchronization informationpassed between the fail-over pair via the redundancy message channel.

BRIEF DESCRIPTION OF THE DRAWINGS

While the appended claims set forth the features of the presentinvention with particularity, the invention, together with its objectsand advantages, may be best understood from the following detaileddescription taken in conjunction with the accompanying drawings ofwhich:

FIG. 1 is a schematic diagram depicting the hosting/hierarchicalrelationships of components within an exemplary supervisory processcontrol network including a multi-layered supervisory process controland manufacturing information system;

FIG. 2 depicts a multi-tiered object hosting arrangement for hostingapplications on platforms and engines within an exemplary systemembodying the present invention;

FIG. 3 is a flowchart summarizing a set of exemplary steps forconfiguring and deploying a redundant host, and more particularly anapplication engine that hosts a set of application objects;

FIG. 4 is an exemplary user interface associated with configuring aredundancy capable host/application engine;

FIG. 5 is an exemplary user interface associated with deploying a nodefor hosting a backup application engine;

FIG. 6 is an exemplary user interface associated with configuring aredundancy message channel (IP address of the network interface card) ona node hosting a backup partner of a fail-over engine pair;

FIG. 7 is an exemplary user interface associated with deploying aconfigured fail-over engine pair;

FIG. 8 is a flowchart including an exemplary set of steps summarizedeploying a fail-over enabled engine pair to their respective hosts;

FIG. 9 is a state diagram summarizing an exemplary set of steps andtransitions for a state machine embodying the operation of a fail-enginepartner;

FIG. 10 is a flowchart summarizing logic performed while a fail-overengine state-machine is within the Standby—Missed Heartbeats state;

FIG. 11 identifies a set of timers associated with monitoring the healthof fail-over engine pairs and the networks and nodes through which thefail-over engine pairs communicate;

FIG. 12 is a flowchart summarizing an exemplary set of steps forcarrying out fail-over in a redundancy enabled host providing access toreal time data, historical data, and alarm data to a set ofclient/subscribers; and

FIG. 13 comprises an exemplary set of interfaces/methods that support aredundancy fail-over host pair.

DETAILED DESCRIPTION OF THE DRAWINGS

The following description is based on embodiments of the invention andshould not be taken as limiting the invention with regard to alternativeembodiments that are not explicitly described herein. By way of example,the present invention is incorporated within a supervisory processcontrol and manufacturing information environment wherein individualdata sources are represented by application objects. An example of suchsystem is described in detail in Resnick et al., U.S. application Ser.No. 10/179,668 filed on Jun. 24, 2002, for SUPERVISORY PROCESS CONTROLAND MANUFACTURING INFORMATION SYSTEM APPLICATION HAVING A LAYEREDARCHITECTURE, the contents of which are incorporated herein by referencein their entirety including the contents and teachings of any referencesidentified/contained therein. However, as those skilled in the art willappreciate in view of the disclosed exemplary embodiments, the presentinvention is potentially applicable to a variety of alternativesupervisory process control environments that include identifiable datasources that provide real-time process data that drives a set of dynamicgraphical elements representing at least a portion of anobserved/controlled industrial process.

Referring to FIG. 1, a schematic diagram depicts thehosting/hierarchical relationships of components within an exemplarysupervisory process control network including a multi-layeredsupervisory process control and manufacturing information system thatincorporates fail-over engine pairs. Before going into a more detaileddescription of the exemplary network environment it is generally notedthat, in this embodiment, data sources are presented, by way of example,in the form of application objects 105 and application objects' 107 thatreceive status information. Furthermore, the application objects 105 andapplication objects' 107 are identified within a global name table 125maintained by a configuration database 124 (e.g., Wonderware's GalaxyRepository)—the contents of which are made available to a developer viaa visualization application development tool 127 (e.g., Wonderware'sINTOUCH software) executing on a configuration PC 120. The visualizationapplication development tool 127, in an embodiment of the presentinvention, submits queries for particular information residing withinthe configuration database to facilitate presenting available datasources (e.g., application objects 105) incorporated by a developer intoone or more process visualization view/windows for a particularapplication (e.g., a manufacturing process line). Once built, theprocess visualization application is potentially executed upon any oneof a set of workstations connected to the supervisory process controlnetwork schematically depicted in FIG. 1.

With continued reference to FIG. 1, a first application server personalcomputer (PC) 100 and a second application server PC 102 collectivelyand cooperatively execute a redundant distributed multi-layeredsupervisory process control and manufacturing information applicationcomprising a first portion 104 and second portion 106. The applicationportions 104 and 106 include device integration application objects PLC1Network and PLC1, and PLC1Network′ and PLC1′, respectively. ThePLCxNetwork device integration objects facilitate configuration of adata access server (e.g., OPC DAServers 116 and 118). The PLC1 and PLC1′device integration objects, operating as OPC clients, access datalocations within the buffers of the OPC DAServers 116 and 118. The dataaccess servers 116 and 118 and the device integration objectscooperatively import and buffer data from external process controlcomponents such as PLCs or other field devices.

In an embodiment of the invention, the requests are submitted by humanmachine interface software executing upon PCs (e.g., PC 120) connectedto network 119 for plant floor information that drives graphicaldisplays representing the plant floor equipment status. The data buffersof the data access servers 116 and 118 are accessed by a variety ofapplication objects 105 and 107 executing upon the personal computers100 and 102. Examples of application objects include, by way of example,discrete devices, analog devices, field references, etc. In theillustrative example, requests for plant floor information andresponsive data are passed between the PCs 100 and 102 (on the plantfloor) and PC 120 via the network 119.

In accordance with an embodiment of the present invention, applicationengines host the application objects (via a logical grouping objectreferred to herein as an “area”). The engines are in turn hosted byplatform objects at the next lower level of the supervisory processcontrol and manufacturing information application. The applicationportions 104 and 106 are, in turn hosted by generic bootstrap components108 and 110. All of the aforementioned components are described hereinbelow with reference to FIG. 2.

In the exemplary system embodying the present invention, themulti-layered application comprising portions 104 and 106 iscommunicatively linked to a controlled process. In particular, the firstapplication server personal computer 100 and the second applicationserver personal computer 102 are communicatively coupled to a firstprogrammable logic controller 112 via a plant floor network 115. It isnoted that the depicted connections from the PCs 100 and 102 to thePLC112 via plant floor network 115 represent logical connections. Suchlogical connections correspond to both direct and indirect physicalcommunication links. For example, in a particular embodiment, the PLC112comprises a node on an Ethernet LAN to which the personal computers 100and 102 are also connected. In other embodiments, the PLC112 is linkeddirectly to physical communication ports on the PCs 100 and 102.

In the illustrative embodiment set forth in FIG. 1, the PCs 100 and 102execute data access servers 116 and 118 respectively. The data accessservers 116 and 118 obtain/extract process information provided by thePLC112 and provide the process information to application objects (e.g.,PLC1Network, PLC1, PLC1Network′, PLC1′) of the application comprisingportions 104 and 106. The data access servers 116 and 118 are, by way ofexample, OPC Servers. However, those skilled in the art will readilyappreciate the wide variety of custom and standardized dataformats/protocols that are potentially carried out by the data accessservers 116 and 118. Furthermore, the exemplary application objects,through connections to the data access servers 116 and 118, represent aPLC network and the operation of the PLC itself. However, theapplication objects comprise a virtually limitless spectrum of classesof executable objects that perform desired supervisory control and dataacquisition/integration functions in the context of the supervisoryprocess control and manufacturing information application.

The supervisory process control and management information applicationis augmented, for example, by the configuration personal computer 120that executes a database (e.g., SQL) server 122 that maintains asupervisory process control and management information applicationconfiguration database 124 for the application objects and other relatedinformation including templates from which the application objects areinstantiated. The configuration database 124 also includes a global nametable 125 that facilitates binding location independent object names tolocation-derived handles facilitating routing messages between objectswithin the system depicted in FIG. 1. The configuration PC 120 andassociated database server 122 support: administrative monitoring for amulti-user environment, revision history management, centralized licensemanagement, centralized object deployment including deployment andinstallation of new objects and their associated software, maintenanceof the global name table 125, and importing/exporting object templatesand instances.

Configuration of the applications, including the creation and deploymentof fail-over application engines (discussed further herein below), iscarried out via an Integrated Development Environment (IDE) 126. The IDE126 is a utility (comprising potentially multiple components) from whichprocess control and manufacturing information applications, includingapplication objects and engines, are defined, created and deployed to avariety of platforms/engines including, for example, the applicationserver PCs 100 and 102. Developers of a supervisory process control andmanufacturing information application, through the IDE 126, carry out awide variety of application design functions including: importing newobject and template types, configuring new templates from existingtemplates, defining new application objects, and deploying theapplication objects to the host application engines (e.g., AppEngine1 onthe application server PC 100).

The exemplary supervisory control network environment depicted in FIG. 1also includes a set of operator stations 130, 132, and 134, connected tonetwork 119, that provide a view into a process or portion thereof,monitored/controlled by the supervisory process control and managementinformation application installed and executing as a set of layeredobjects upon the PCs 100 and 102. A RawMaterial PC 130 provides arepresentative view enabling monitoring a raw materials area of asupervised industrial process. A ProductionPC 132 presents arepresentative view of a production portion of the supervised industrialprocess. A FinishedProductPC 134 provides a representative view of anarea of a production facility associated with finished product. Each oneof the operator stations 130, 132, and 134 includes a bootstrap host foreach of the particular operator station platforms. Each one of theoperator stations 130, 132, and 134 includes a view engine thatprocesses graphics information to render a graphical depiction of theobserved industrial process or portion thereof.

In an embodiment of the present invention, PC 102 provides fail-oversupport for PC 100. By way of example, fail-over support occurs at theapplication engine level (e.g., AppEngine 1 and AppEngine 1′). Thus,when AppEngine 1 on PC 100 fails/shuts down, AppEngine 1′ (having a sameassigned reference name as AppEngine 1 in the global name table 125) onPC 102 is configured to take over responsibilities (e.g., hostingapplication objects) previously assigned to AppEngine 1. Fail-oversupport on the application engine level provides high availability forapplication objects, hosted by a fail-over enabled engine pairconfiguration across a runtime failure of a currently active engine ofthe fail-over engine pair. An application engine, in an embodiment ofthe invention, is enabled/designated for fail-over during aconfiguration stage. During configuration, only a primary engine isconfigurable (e.g., application objects are assigned to the primaryengine). After a fail-over enabled application engine is checked-in,primary and backup application engines of an application enginefail-over pair are deployed to a first platform and second platform(residing on distinct networked machines). In a runtime environment, theprimary engine is generally assigned an active role of the fail-overenabled application engine pair and therefore starts up/hosts/executes aset of hosted application objects.

On the other hand, an application engine assigned a backup role duringconfiguration/deployment provides redundancy support for the fail-overpair. The backup engine, generally assigned a standby role of afail-over enabled engine pair at runtime, ensures a high degree ofavailability of at least one engine of the application engine pair andhosted application objects. The backup engine is created when afail-over configured application engine is checked in. The backupengine, and its analogous standby engine at runtime, contains thenecessary components (e.g., software and data) for creating/hostingapplication object instances that are associated with the fail-overenabled application engine. However, in an embodiment of the invention,the application objects are neither started up nor executed on thebackup engine of the fail-over engine pair. During runtime the standbyengine of a fail-over enabled application engine monitors the status ofthe primary engine and checkpoints critical data in contemplation oftaking over executing the application objects hosted by the fail-overenabled application engine pair in the event that the active engineceases to operate. Upon detection of a failure of the current activeapplication engine, the standby engine (e.g., AppEngine 1′ on PC 102)becomes the active engine and performs the tasks associated with hostingthe application objects on the fail-over enabled application enginepair. In particular, upon taking on the active engine role, the nowactive engine invokes startup methods on the hosted application objectsand commences execution of the application objects in place of thefailed partner of the fail-over enabled application engine pair. By wayof example, when the standby engine acquires the role of active engine,it takes over responsibility for references that facilitate modifyingattributes, monitoring changes to attributes, and retrieving data froman attribute. Such references are associated with supervisory, user andsystem reference sets associated with the hosted application objects.

An aspect of the fail-over enabled application engine pair disclosedherein is the relative transparency of the backup engine and the standbyengine. In an embodiment of the invention, a user designates a host fora backup engine. However, deploying a backup engine is performedautomatically, without intervention by a user. A user generallyimplements control/configuration of the fail-over enable applicationengine through operations on the primary/active engine. Furthermore, theactive and standby engines share a single global name within asupervisory control system runtime environment. Thus, in the event offail-over to the standby application engine, there is no need to changeany references used to identify the fail-over enabled application enginepair. Though access to hosted application objects may be temporarilylost during fail-over (while the standby engine acquires the active roleand starts up hosted application objects/primitives), clients areunaware of the switch to the standby (now active) application engine andcontinue using a same set of global references to access the resourcessupported by the fail-over enabled engine pair—though the physicallocation of the responsive application objects has changed.

In accordance with an embodiment of the present invention, the fail-overenabled application engine pair perform synchronization operations tofacilitate a change in role of the standby engine to active enginestatus. Examples of synchronized data include: checkpoint files(including configuration/tuning values, alarm limits, and deployedobjects on the active engine), alarm states (time stamped), subscriberlists (to data provided by hosted objects), live data, and data within astore and forward buffer (to be passed, for example, to a process statushistory database). Once initially loaded, the active engine trackschanges to synchronized information (e.g., checkpoint deltas) and sendsonly the changes (as opposed to passing complete copies of thesynchronized information). Sending only changes significantly reducesthe volume of traffic over a link 140 (described further herein below).This is especially important since embodiments of the inventioncontemplate a single PC (e.g., PC 102) hosting multiple instances ofeither active or standby engines. In the case where multiple applicationengines are configured as fail-over pairs on two PCs (e.g., PC 100 andPC 102), the link 140 is shared by all the fail-over engines to carryout communications relating to their fail-over functionality.

By way of example, checkpoint data is passed from the PC 100 (runningthe primary engine) and PC 102 (containing the backup engine) via thelink 140 referred to herein as a redundancy message channel (RMC). Thelink 140 (e.g., an Ethernet link, an 802.11x wireless link, etc.) isphysically separate and distinct from the plant floor network 115 andsupports transferring essential information between PC 100 and PC 102 athigh data rates to implement a fail-over/backup functionality. In anembodiment of the invention, a fail-over enabled engine (e.g., AppEngine1 on PC 100) includes a system attribute (remote partner address or“RPA”) that facilitates specifying an Internet Protocol address of anetwork interface associated with the backup engine side of link 140. Onstartup the primary engine (e.g., AppEngine 1) utilizes the RPAattribute to send a message to a specified host name or IP address toinitially contact the platform that hosts its fail-over engine partner(e.g., AppEngine 1′) via the Redundancy Message Channel(RMC)—represented in FIG. 1 by link 140. This initial message informsthe backup/standby engine (or any other interested entity including theplatform host for the backup/standby engine) of the IP address of theprimary engine's host platform. In an embodiment of the invention, theRPA is calculated after a node/platform for the backup engine isspecified. Thus, the RPA is potentially designated during aconfiguration stage or during a deployment stage where the fail-overenabled configuration is loaded onto specified platforms on a network.In an exemplary embodiment, a single RPA is assigned to a physicalnetwork interface for a platform (PC) that potentially hosts multipleapplication engines. However, distinct references (e.g., handles, names,etc.) are assigned to each fail-over application engine to distinguishmultiple application engines hosted by a single platform.

It is noted that the system depicted in FIG. 1 and described hereinaboveis merely an example of a multi-layered hierarchical architecture for asupervisory process control and manufacturing information systemincluding redundant/fail-over application servers for ensuring thecontinuous supply of data from a plant floor network 115 to humanmachine interface computers on the network 119. The present invention isnot limited to the particular disclosed application/system, and in fact,need not be implemented in the form of a multi-leveled application asshown in the illustrative example. It is further noted that FIG. 1 ispresented as a logical view of the hosting and/or containmentinterrelations between installed components including software andphysical computing hardware. The present invention is suitable forvirtually any network topology. For example, the present invention isapplicable to a system wherein both configuration utility andsupervisory process control visualization applications run on a singlecomputer system linked to a controlled process.

Turning to FIG. 2, a class diagram depicts the hierarchical hostingarrangement of layered software associated with a computer (e.g., PCs100 or 102) executing at least a portion of a supervisory processcontrol and manufacturing information application. Each computerexecutes an operating system 200, such as MICROSOFT's WINDOWS at alowest level of the hierarchy. The operating system 200, hosts abootstrap object 202. The bootstrap object 202 is loaded onto a computerand activated in association with startup procedures executed by theoperating system 200. As the host of a platform class object 204, thebootstrap object 202 must be activated before initiating operation ofthe platform class object 204. The bootstrap object 202 starts and stopsthe platform class object 204. The bootstrap object 202 also rendersservices utilized by the platform class object 204 to start and stop oneor more engine objects 206 hosted by the platform class object 204.

The platform class object 204 is host to one or more engine objects 206.In an embodiment of the invention, the platform class object 204represents, to the one or more engine objects 206, a computer executinga particular operating system. The platform class object 204 maintains alist of the engine objects 206 deployed on the platform class object204, starts and stops the engine objects 206, and restarts the engineobjects 206 if they crash. The platform class object 204 monitors therunning state of the engine objects 206 and publishes the stateinformation to clients. The platform class object 204 includes a systemmanagement console diagnostic utility that enables performing diagnosticand administrative tasks on the computer system executing the platformclass object 204. The platform class object 204 also provides alarms toa distributed alarm subsystem.

The engine objects 206 host a set of application objects 210 thatimplement supervisory process control and/or manufacturing informationacquisition functions associated with an application. The engine objects206 initiate startup of all application objects 210. The engine objects206 also schedule execution of the application objects 210 with regardto one another with the help of a scheduler object 208. Engine objects206 register application objects 210 with the scheduler object 208 forexecution. The scheduler object 208 executes application objectsrelative to other application objects based upon a configurationspecified by a corresponding one of the engine objects 206. The engineobjects 206 monitor the operation of the application objects 210 andplace malfunctioning ones in a quarantined state. The engine objects 206support check pointing by saving/restoring changes to a runtimeapplication made by automation objects to a configuration file. Theengine objects 206 maintain a name binding service that binds attributereferences (e.g., tank1.value.pv) to a proper one of the applicationobjects 210.

The engine objects 206 ultimately control how execution of associatedones of the application objects 210 will occur. However, once the engineobjects 206 determine execution scheduling for application objects 210,the real-time scheduling of their execution is controlled by thescheduler 208. The scheduler 208 supports an interface containing themethods RegisterAutomationObject( ) and UnregisterAutomationObject( )enabling engine objects 206 to add/remove particular ones of theapplication objects to/from the scheduler 208's list of scheduledoperations.

The application objects 210 include a wide variety of objects thatexecute business logic facilitating carrying out a particular processcontrol operation (e.g., turning a pump on, actuating a valve), and/orinformation gathering/management function (e.g., raising an alarm basedupon a received field device output signal value) in the context of, forexample, an industrial process control system. Examples of processcontrol (automation) application objects include analog input, discretedevice, and PID loop objects. A class of the application objects 210,act upon data supplied by process control systems, such as PLCs, viadevice integration objects (e.g., OPC DAServer 118). The function of theintegration objects is to provide a bridge between processcontrol/manufacturing information sources and the supervisory processcontrol and manufacturing information application.

The application objects 210, in an exemplary embodiment, include anapplication interface accessed by the engine objects 206 and thescheduler 208. The engine objects 206 access the application objectinterface to initialize an application object, startup an applicationobject, and shutdown an application object. The scheduler 208 uses theapplication object interface to initiate a scheduled execution of acorresponding application object.

Having described the primary components of an exemplary supervisoryprocess control and manufacturing information network environment,attention is directed to an exemplary set of steps summarized in FIG. 3that are interactively performed, in part, via a supervisory process andmanufacturing information system component configuration utility such asthe previously mentioned IDE 126. In the illustrative example, theconfiguration utility comprises a graphical user interface that exposesa set of parameters associated with defining and deploying aredundant/fail-over enabled host, and in particular a fail-over enabledapplication engine pair. The parameter values specified by a userthrough the interface are utilized during later deployment (orredeployment) of the fail-over host/application engine pair. It is notedthat while the illustrative example is directed to an applicationengine, the present invention is potentially applicable to a variety ofhost objects, and seeks to provide a streamlined and user-friendly wayof configuring redundancy in a system and ensure backup availability ofhost components in a supervisory process control and manufacturinginformation system. Furthermore, the ordering of the steps is intendedto be exemplary. Those skilled in the art will readily appreciate theability to modify the order of completing various stages describedherein below in accordance with alternative embodiments of theinvention.

Step 300: Enabling Fail-Over for an Application Engine DuringConfiguration

Initially, during step 300 a user enables and customizes fail-overbehavior for a selected application engine object. The selections/valuesdesignated for the application engine during step 300 are registered bythe configuration utility (e.g., IDE 126) for later use when theapplication engine configuration selections are checked in and deployed.Referring to FIG. 4, application engine fail-over behavior is enabledand customized, by way of example, through a set of values submitted bya user via a redundancy properties interface generated by theconfiguration utility.

In the illustrative example, the configuration utility user interfacepresents a number of tabs relating to configuration of the applicationengine 402 (selected in the deployment view area of the configurationuser interface of FIG. 4). A user selects a Redundancy tab 400 on theconfiguration utility interface to expose a set of parameters, depictedin a properties view 401, associated with defining redundancy/fail-overbehavior for a currently selected application engine (AppEngine_(—)001)402. In an embodiment of the invention, a user designates redundancy forthe selected application engine 402 by “checking” an Enable redundancycheckbox 404. In response to the fail-over designation, a fail-overdynamic primitive is added to the application engine object and theengine is designated as the primary engine of a fail-over pair. Whilenot shown in FIG. 4, the backup engine for application 402 is initiallyassigned in the deployment view to the unassigned host 405. A userthereafter re-assigns the backup engine (via drag and drop) to an actualplatform node depicted in the deployment view. After the configurationof the application engine 402 is saved/checked in (releasing an editinglock on the object) during step 360 (described herein below) andvalidated by calling a validate method on the object, a backup engineobject is created by a utility that manages objects within the system.

The illustrative fail-over configuration interface set forth in FIG. 4also supports a set of user-specified parameters defining the fail-overbehavior of the application engine 402. A forced fail-over timeout 406enables a user to designate a period of time that a currently activeapplication engine is given to execute a user-initiated fail-over to astandby application engine that otherwise waits in a standby state. Amaximum checkpoint deltas buffered 408 enables a user to specify amaximum number of checkpoint delta packages that will be buffered beforeinitiating a full re-synchronization of the checkpointed information. Atypical value for the maximum checkpoint deltas 408 is zero (when thereis plenty of bandwidth to transfer the checkpoint delta packages to thestandby engine during a scan cycle), and is used to handle exceptionalcases such as a slow synchronization link. A maximum alarm state changesbuffered 410 enables a user to specify the maximum number of alarm statechange packages that will be buffered before the active applicationengine will initiate a complete re-synchronization of the alarm states.

The redundancy/fail-over parameters exposed by the exemplaryconfiguration user interface include a set of parameters relating toheartbeats transmitted/broadcast by the active and standby applicationengines to other system components. The heartbeats are periodictransmissions, to which recipients need not respond, that provideassurance that the heartbeat sender is operational. A standby engineheartbeat period 412 and an active engine heartbeat period 414 specifyperiods between transmissions of heartbeat messages by each of the twoengine role types. A maximum consecutive heartbeats missed from activeengine 416 and a maximum consecutive heartbeats missed from standbyengine 418 specify a number of consecutive elapsed heartbeat periodsthat are registered by a listener (i.e., intended recipient of theheartbeat transmissions) before registering a fail-over paircommunication failure. Such failures are potentially handled bysupervisory scripts that perform any one of a variety of operationsincluding, by way of example, generating a warning/alarm message to amonitor, initiating fail-over to a standby partner engine, andre-deploying (automatically or upon direction from a user) thenon-responding fail-over engine partner. The use of heartbeats in afail-over scheme is discussed further herein below.

Transferring responsibilities from an active engine to a standby enginedoes not commence until the standby engine has become active. If thetime delay between when a client engine becomes aware of theprimary/active engine's failure and when the client engine receivesnotification that the backup/standby has become active exceeds aconfigured limit, then the quality of all references associated with thefailed engine are set to uncertain. The configured time delay limit isspecified by a user via a maximum time to maintain good quality afterfailure parameter 420. Yet another parameter, a maximum time to discoverpartner 422, enables a user to specify how long the primary engine waitsfor a response from its backup engine, after issuing a connectionrequest via the RMC, before registering a failure. A force fail-overcommand 424 enables a user to specify an alphanumeric string that, whenprovided by a supervisor/administrator, will force transfer of activestatus from the currently active engine to the current standby enginewithout waiting for the currently active engine to fail.

Steps 310 and 320: Configuration of a New Platform Host for the BackupEngine

With continued reference to the illustrative example set forth in FIG.4, the application engine 402 and its backup engine must be deployed toseparate platforms/nodes. If, at step 310, a platform for hosting thebackup of the application engine 402 (on the platform identified in thedeployment view as “Node_A”) does not yet exist, then control passes tostep 320 wherein a platform is configured/created to host the backupengine for application engine 402. As indicated by a tree structure 403(depicting a configured physical deployment view of applicationcomponents in a system including multiple networked computing nodes), asecond physical networked computing device node/platform object does notyet exist for hosting a backup application engine for the applicationengine 402 deployed to a platform object identified in the treestructure 403 as “Node_A”. Therefore, during step 320 a user creates anew node/platform, by dragging and dropping a copy of a $WinPlatformtemplate 407 from a template toolbar tree into the deploy view area.

Turning briefly to FIG. 5, an exemplary deployment view depicts aredundant engine pair configuration after a user has created a newnode/platform (Node_B) to host the backup engine for application engine402 that resides on Node A. After creating Node_B, the backup for theapplication engine (AppEngine_(—)001) 402 is placed upon Node_B bydragging and dropping “AppEngine_(—)001 (Backup)” from the UnassignedHost directory to the Node_B platform on the depicted Deployment viewtree. The Node B will, as depicted in FIG. 5, host the backup(AppEngine_(—)001 (Backup) for the application engine (AppEngine_(—)001)402 on Node A. Upon completing creating/configuring a new platform tohost the backup application engine, control passes from step 320 to step330.

On the other hand, if the host platform for the backup engine alreadyexists, then control passes directly from step 310 to 330.

It is noted that creating application components (e.g., a node/platform,an engine, an application object, etc.) in the deployment view of aconfiguration environment is a distinct operation from “deploying”components to physical computing machines within a network. Withcontinued reference to FIG. 5, an “Object” menu 500 includes a “deploy”option 502 for carrying out the actual deployment of one or moreselected components from the deployment view. When the “deploy” optionis selected in conjunction with a previously selected “Node_A”, aplatform, corresponding to Node_A in the deployment view, and allcomponents under Node_A, are installed upon a networked computingmachine corresponding to Node_A. Such deployment of applicationcomponents is described further herein below.

Steps 330/340: Configuring the RMC on the Backup Platform

In addition to a backup engine host, a fail-over application engine pairalso relies upon a fail-over communications link, and in particular aredundancy message channel (RMC). The RMC provides a communications pathbetween host platforms of fail-over partners through which the primaryand backup engines exchange information including, by way of example,checkpoint, status, and command/control information. Each host platformon the RMC is assigned a unique physical network address. In anillustrative exemplary embodiment, the RMC utilizes a network pathbetween PCs that is physically separate from a primary general networkpath utilized by the host PCs for a variety of other purposes. By way ofexample, the RMC utilizes link 140 (e.g., an Ethernet link) that isphysically separate from network 119. In an alternative embodiment, theprimary general network (e.g., network 119) is utilized. However, usingthe general network 119 is less desirable in many instances due to theeffect of the additional workload associated with the RMC on theperformance of network 119.

The RMC is potentially used by multiple fail-over pairs for purposes ofcarrying out fail-over/redundant engine-related communications. In oneexample of using the RMC to handle multiple fail-over pairs, sharing ofthe link 140 is contemplated to facilitate an “N on 1” fail-overconfiguration wherein a single platform hosts the backup counterpart fora set of N primary application engines configured for fail-over. Infact, the primary application engines need not be present on the samehost PC. Instead, a single platform (e.g., ApplicationServer2PCPlatform)potentially hosts backup engines for multiple primary engines withdifferent host PCs. In such instance, the link 140, by way of example,comprises a multi-drop network bus and each platform hosting a primaryor backup engine shares a common network (corresponding to link 140) fortheir RMC. Workload is balanced to ensure that, in the event of multiplefail-overs, activating multiple standby engines on a single platformdoes not cause scan overruns on the host of the standby engines whenthey assume the active engine role. Such contingent behavior ispotentially handled by executing a supervisory script upon the platformhosting the fail-over backup engines to monitor workload and relocatebackup engines to other available platforms. Relocating the backupengines in response to detected load avoids overloading a platform(computing device/node) that, as a consequence of multipleprimary/active engine failures, is forced to support multiple activeapplication engines.

Alternatively, in the case where multiple backups are hosted on a singleplatform host, multiple RMCs (and corresponding network adaptors havingdistinct network addresses) can be provided for the single platform hostsuch that each fail-over pair is assigned a separate RMC. In yet otherembodiments, a combination of dedicated and shared RMCs are supported bya single platform host.

With continued reference to FIG. 3, during step 330 if an RMC has notyet been set up on the backup host (Node_B), then control passes to step340. At step 340 the configuration utility presents a user interfacethat exposes a set of parameters enabling user to specify a networkaddress corresponding to the backup engine's host platform (Node_B) onthe RMC. Referring to FIG. 6, the configuration interface for a platform(e.g., Node_B) includes a set of “Redundancy” configuration fields forspecifying the RMC channel. In particular, a redundancy message channelIP address 600 enables a user to specify a physical (IP) address (e.g.,192.168.001.102) corresponding to the network address/name assigned tothe platform (e.g., Node_B) on the RMC link. The value in the redundancymessage channel IP address 600 is the RPA for node A. Furthermore, theuser specifies a redundancy message channel port 602 and a redundancyprimary channel port 604. These are the ports for maintaining theheartbeats over the RMC and the primary channel. The RMC IP address 600has been referred to previously above as the “Remote Partner Address”(RPA). The RPA is utilized by the host of the primary engine, after afail-over enabled engine pair is checked in and deployed to appropriateplatforms, to contact a corresponding backup engine host via the RMC.

In an embodiment of the invention, a message routing service on aplatform resolves engine names to addresses. The message routing serviceexecuting on the host platform of an engine detects communicationsacross the RMC directed to a corresponding fail-over partner engine anddirects the communications to an appropriate engine. Furthermore, themessage routing service's ability to distinguish between differingengines (through name resolution operations on their distinct names) ona same RMC facilitates N on 1 fail-over scenarios as well astransparently relocating a fail-over enabled engine to a new platform.

FIG. 6 includes a set of fields relating to general operations of theNode_B on a primary network (for communicating with a variety of otherhost nodes). A network address, which can be either a physical (e.g.,IP) address or a name, corresponds to the address of Node_B on theprimary network. A history store forward directory field specifies alocation of store forward data on Node_B (for buffering data fortransmission when the primary network is down or too slow to handleNode_B's data transmission flow.

FIG. 6 also includes a set of fields relating to a message exchangeservice carried out on a primary network to which Node_B is attached. Amessage timeout value identifies how long Node_B waits for a responsebefore assuming a sent message is lost. An NMX heartbeat period allowsfor slow networks to avoid timing out when heartbeats are potentiallylost/delayed due to a slow link. Consecutive missed heartbeats is amultiplier.

It is noted that a physical address was specified for the RMC of thebackup engine host in the example set forth in FIG. 6. However, in analternative embodiment of the invention, during step 340 a userspecifies a host name corresponding to the physical IP address in theRMC IP address 600, and the name is thereafter resolved by a nameservice to a corresponding physical IP address. After setting up anaddress on the RMC for the backup engine host, control passes to step350. On the other hand if, at step 330, an address on the RMC is alreadyset up for the backup engine host (Node_B), then control passes to step350.

Step 350 Setting RPA on Primary Engine

During step 350, the platform hosting the primary engine's (e.g.,application engine 402) configuration is supplemented to include theaddress of the backup engine (of application engine 402) host platformon the RMC (the aforementioned RPA attribute). The RPA attributefacilitates the primary engine initiating a connection with itscorresponding backup engine.

Step 360 Checking in Redundant Configuration

Thereafter, during step 360 the application engine, having redundancyenabled, is “checked in” on the configuration database 124. Checking inthe application engine releases a locking mechanism that prevents othersfrom changing a checked out application engine while it is, for example,being configured/edited. Checking in an application engine withredundancy enabled also triggers creation of a backup engine instance(assuming one does not currently exist for the particular applicationengine). Attributes are copied from the primary engine to the newlycreated engine instance, and a backup “role” attribute is assigned tothe new engine instance. The backup role attribute distinguishes thebackup engine from its primary engine partner during deployment of theengine partners to their respective platforms during step 370 describedherein below. In an exemplary embodiment, the backup engine is initiallyassigned to a default platform, but can be reassigned via the IDE 126 toanother platform. The backup engine is assigned to the same “area”(corresponding to a grouping of closely related components of a processcontrol and manufacturing information system) as the primary engine.

A backup application engine, as a result of copying parameters specifiedfor the primary engine, has the same configuration data as its partnerprimary engine. Therefore, if a backup engine already exists at the timethe primary engine is checked in with redundancy enabled, then thesystem checks out the backup engine, copies updated configuration data(attributes) from the primary engine to the checked out backup engine,and checks in the modified backup engine. Thus, the backup engine has acopy of the primary engine's configured deployment package.

The configuration information in the backup engine is substantially thesame as the primary engine. An exception to this general statement isthe “remote partner attribute (RPA)” of the redundancy primitive. Thedistinct RPA attribute is specified first for the primary engine (duringstep 350) and later in the backup engine (during step 380) after boththe primary and backup applications have been deployed to theirrespective platforms.

Though not a part of the steps set forth in FIG. 3, a backup engine thathas not yet been deployed is deleted when its primary partner is checkedin with the redundancy option (e.g., enable redundancy 404) disabled.The removal of the backup engine is broadcast to current clients havingreferences to the redundancy-enabled primary application engine—sincethe clients potentially have current engine and platform identificationscorresponding to the backup engine. In the deployment configuration viewof the system, the application engine will no longer visually indicatethat it is a primary partner of a fail-over pair. On the other hand, ifan application engine is checked in with the redundancy option disabled,and it has a backup engine in a deployed state, then checking in theprimary engine will fail. Therefore, prior to removing a backup engine,the backup engine must be un-deployed.

Step 370 Deploying Configured Redundant Engines (and Hosts if Necessary)

With continued reference to FIG. 3, after the redundancy enabledapplication engine configuration is checked in, during step 370 a userinvokes a deploy operation on the configured redundant applicationengine pair. By way of example, deployment of the redundant applicationengine configuration package is initiated when a user invokes a globaldeploy operation by selecting the deploy option on the “Object” menuafter selecting a Galaxy containing the application engine (see, e.g.,“MyGalaxy” in the deployment tree 700 of FIG. 7). Deploying theredundant application engine pair—marking a transition from aconfiguration environment to a runtime environment—includes copyingfiles and information associated with the application engines (includingplatform files if necessary) to appropriate host machines.

The illustrative example of a fail-over architecture embodying thepresent invention utilizes a role-based approach to redundancy duringconfiguration, deployment and runtime. Primary/backup roles areinitially assigned to redundant application engines duringconfiguration. Turning briefly to FIG. 7, the distinct roles of primaryand backup engines are incorporated into a configuration/deployment viewof an application engine with redundancy enabled. In particular, for anapplication engine that is configured to host a set of applicationobjects, an application engine (AppEngine_(—)001) 702 node (representinga primary application engine) enumerates a set of application objects asleaves under the application engine 702 node. In a runtime environment(described herein below) application objects are only executed upon anactive application engine (the runtime analog of a primary engine in theconfiguration/deployment environment). The limitedfunctionality/presence of application objects on a standby engine (theruntime analog of a backup engine in the configuration/deploymentenvironment) is visually represented in FIG. 7 by not displayingapplication objects under an application engine Backup(AppEngine_(—)001) 704 node in the configuration/deployment view.

Deploying the Fail-Over Engine Pair

Turning to FIG. 8, a set of steps summarize deploying a fail-overenabled application engine pair to their respective hosts during step370. In the exemplary embodiment, the primary and backup rolesestablished during configuration determine an order of operations whenthe redundant application engines are deployed to their respectiveplatforms during step 370. When a user requests deploying a primary andbackup engine, the system ensures a primary engine is fully deployedprior to deploying its associated backup engine. This also ensures thatthe primary engine will assume the role of active engine and the backupengine will initially detect the presence of an operational activeapplication engine and acquire the standby role.

In a particular embodiment of the present invention, during step 800 adeployment server initially invokes a deploy command specifying adeployment package associated with the primary application engine. Inresponse, during step 802 information is acquired identifying theplatform, files, node name and application objects associated with theprimary engine. The primary engine object itself and files andinformation utilized by the primary engine object are thereaftertransferred during step 804 to (if not already present upon) a nodecontaining the platform that hosts the primary engine. During step 804,the primary engine object is created and launched on the node. Uponcompleting step 804, the primary engine's status is set to “Deployed”during step 806. At this point none of the application objects hosted bythe primary engine have been deployed to the primary engine. Instead,deploying the application objects is performed in a runtime environmentwherein one of the fail-over enabled application engine pair hasacquired “active” runtime status.

In the exemplary embodiment, deployment is carried out sequentially byinitially deploying the primary engine and then deploying the backupengine of a fail-over application engine pair. The primary and backupapplication engines should be deployed to distinct platforms. In theexemplary embodiment, after successfully deploying the primary engineand before deploying the backup engine, at step 808 the platformsspecified for hosting the primary and backup engines are compared. Ifdifferent platforms are specified, then control passes to step 810wherein the deployment server invokes a deploy command specifying adeployment package associated with the backup application engine.Thereafter, steps 812 and 814—that correspond to steps 802 and 804described herein above—are carried out with regard to the backupapplication engine. Thereafter, at step 816 the backup engineconfiguration status is set to “Deployed” status. The backup applicationengine, like the primary application engine, does not host anyapplication object at the time of completing step 816. Control thenpasses to the End.

On the other hand, if at step 808 the same platform is specified to hostthe primary and backup engines of a redundant pair (the equivalent of asame networked machine since a single platform is present on anymachine), then control passes to step 818 wherein deploying the backupapplication engine is bypassed, and a partial success/failure to deploythe redundant fail-over engine configuration is registered/reported.Control then passes to the End.

Un-deploying a fail-over pair is facilitated by an “un-deploy” command(see “undeploy” option under the Object menu in FIG. 7) supported by theIDE 126. The fail-over pair can be un-deployed by individual selectionof each engine or simultaneously using the “un-deploy both” option in anUn-deploy dialog. When the “un-deploy both” option is selected, thestandby engine is un-deployed first and then the active engine. When ahardware failure occurs causing a fail-over, a user typically un-deploysthe fail-over enabled engines from a failed node and re-deploys theengines on a new node. The user marks the engines as un-deployed torelocate the engines to a new host platform. Marking an engine asun-deployed on failure applies to either engine in a fail-over pair.

Step 380 Establishing a Connection Between Primary and Backup EnginesVia RMC

Returning to FIG. 3, after completing the deploying step 370 theapplication engines exist on their respective platforms in a runtimeenvironment wherein the active and standby engines of a fail-over paircommunicate with each other and monitor each other's status through anRMC. Therefore, during step 380 the primary application engine issues arequest to connect to its fail-over backup engine. In an embodiment ofthe invention, the connection request is issued via the RMC and includesthe remote partner address (RPA), corresponding to the host platform ofthe backup engine, in the destination field (configured on the primaryengine during step 350). The source field identifies the physicaladdress of the platform that hosts the primary engine. The initialconnection request serves to inform the backup engine (or host platformof the backup engine) of the physical address for its primary engine onthe RMC, and the backup engine updates its RPA attribute based upon theaddress specified in the source field of the connection request.

Step 390 Deploying Application Objects and Related Files to ActiveEngine

The distinct/differing roles assigned to particular engines of afail-over application engine pair are incorporated into a runtimeenvironment (described herein below with reference to FIG. 9) whereinone engine of each fail-over application engine pair isassigned/acquires a role of “active engine” and the other engine isassigned/acquires a role of “standby engine”. The active engine of afail-over pair can be either the primary or backup engine of a fail-overengine configuration. However, only one of the two application enginescan be the active engine at any time.

The current runtime role (e.g., active or standby) of an applicationengine determines the manner in which application objects and relatedcomponents (e.g., files) are provided to a platform hosting an instanceof an application engine of a fail-over engine pair. During step 390—astep that can occur at any point after the primary application engine isoperational (even before step 380 wherein the RMC isestablished)—application objects and related components are deployedfrom a configuration database/file repository to the active engine of afail-over enabled pair via a primary network (e.g., network 119 in FIG.1).

The following summarizes an exemplary sequence of steps for deploying anapplication object and associated/required components to a particularactive application engine deployed from a fail-over enabled applicationengine configuration. In response to an instruction/command to deploy aspecified application object to a fail-over enabled application enginepair, the status (e.g., active or standby) of both the primary andbackup application engines is determined. Thereafter, a node name (oraddress) for a node where an active application engine (of a deployedprimary and backup engine pair) resides is obtained. Next, furtherinformation is acquired relating to the node, platform, and activeapplication engine. Furthermore, information is acquired for thespecified application object and any components (e.g., files) requiredby the application object (that are to be deployed with the applicationobject) on the node containing the active application engine that willhost the deployed application object. Thereafter, components identifiedas needed to support instantiating and executing the application objecton the active application engine are deployed to the node.

In a particular embodiment, deploying required components forinstantiating and executing an application object on a particular activeapplication engine is optimized to identify components (e.g., files)that are already on a target platform that hosts the active applicationengine. Only components that do not already exist on the target platformare transferred during step 390. Furthermore, if a particularapplication object is already deployed on the target host engine, thencomponents previously loaded on the node associated with the applicationobject (and not in use by other application objects) are undeployed fromthe node, and a table of deployed components (e.g., files) on the nodeis updated to reflect removal of the undeployed components. Thereafter,a fresh set of components associated with the application object aredeployed to the node. The table of deployed components on the node isupdated to include the loaded components.

After receiving the aforementioned components (e.g., files), during step395 the active application engine deploys the application object andrelated components to a second node upon which a standby applicationengine resides. During deployment the active engine's host obtains alist of components that are needed by the backup engine to hostapplication objects. The primary engine deploys the listed components tothe standby engine via the RMC. It is noted that a platform running thestandby application engine potentially hosts other application engines.Thus, the node hosting the standby engine potentially has some or all ofa set of components needed to instantiate and execute the applicationobject deployed on the active engine during step 390. Thus, whencomponents are transferred over the RMC, the sender initially determineswhich ones of the needed components are already present on the node uponwhich the standby engine resides. Only the components that are notalready present on the standby engine's node are transferred via the RMCduring step 395.

Having described configuration and deployment of a fail-over enabledhost, and more particularly an application engine that hosts a set ofapplication objects in a hierarchical application environment, attentionis directed to runtime aspects of the fail-over arrangement describedherein above. After deploying the application engines to theirrespective platforms, in a runtime environment object instances (e.g.,platforms, application engines, and application objects) associated withthe configured fail-over engine pair are created, initialized andlaunched (if appropriate) on the host machines to carry out appropriateruntime functionality associated with a current particular role (e.g.,active/standby) and status (e.g., ready/not ready) of each partner of afail-over application engine pair.

As demonstrated below, once deployed, the operation/behavior of anapplication object differs substantially based upon the runtime status(e.g., active or standby) of the application object's host applicationengine. In an exemplary embodiment, rather than operate two equivalenthost (application engine) replicas, only the active engine of thefail-over application engine pair calls startup and execute methodsassociated with a set of application objects during runtime. The standbyapplication engine, while having all the necessary components needed toexecute the set of application objects, assumes a standby role whereinpreparatory operations are performed for executing the applicationobjects but execution of the application objects is not commenced.

The following summarizes, by way of example, the operation of anapplication object after being deployed to a standby engine. Uponcompleting step 395 the standby engine verifies that all components(e.g., code modules) required to run the deployed application object areinstalled on the node. Upon confirming that all components are indeedpresent, the deployed application object is added to a checkpoint filemaintained by the standby application engine of a fail-over engine pair.In preparation for starting the application object a pre-initializationpiece of an application object, referred to as a base runtime componentserver, is created. Primitives associated with the deployed applicationobject are instantiated (by invoking constructors on the primitives).Initialize methods are called on each primitive.

However, methods associated with active execution of the applicationobjects (e.g., startup, execute, scan state, handler, etc.) are notcalled on the primitives associated with the application object on thestandby engine. Invoking such methods, associated with an activelyexecuting application object, is postponed until a need arises for thestandby engine to take on the active application engine role/status. Bynot starting up and executing application objects on a standby engine,workload on the node upon which the standby engine resides issubstantially reduced (on a per application object basis) aftercompleting the invoked preparatory methods. The reduced steady-stateworkload associated with a standby application engine facilitates havinga single platform/node host multiple backup engines.

When an application engine switches from the standby role to the activerole, startup methods on the primitives that make up each hostedapplication object are invoked. In an exemplary embodiment, a parameteris passed into the startup method informing the primitive that it isstarting up in the context of a fail-over event. Next, setscanstatemethods are invoked on primitives. The scan state of the object and (nowactive) application engine determine whether a value of true of false ispassed into the setscanstate method to determine whether the primitivewill be onscan (true) or offscan (false). All onscan primitivesassociated with the application object are periodically executed underthe supervision of the host active application engine.

Conversely, when an active engine becomes a standby engine the hostedapplication objects revert to an inactive ready state. In particular,all application objects are set offscan. A shutdown method is invoked oneach primitive associated with the application objects and execution ofthe application objects ceases. However, the interfaces of theprimitives are not released—facilitating fast startup in the event thatthe application engine re-acquires the active role of the fail-overapplication engine pair.

In an exemplary embodiment the current/next role/status of each partnerengine of a fail-over pair is tracked/governed by a state machine. FIG.9, described herein below, summarizes the fail-over states that afail-over-enabled application can occupy and the potential transitionsbetween the set of exemplary states. In general, the exemplary set ofstates can be divided into two classes: (1) “Summary” states, and (2)“Detail” states. While in Summary states, fail-over status informationis provided that is used to determine the current general operationalstatus of a particular engine. In the illustrative embodiment, theSummary states include: Determining fail-over state 900, Standby—NotReady state 902, Standby—Ready state 904, and Active state 906. While inDetail states, relatively more detailed information (in comparison toSummary states) is provided about the operational status of a fail-overengine partner. In particular, Detail states indicate why the active orstandby engine has entered a particular sub-state. In the illustrativeembodiment, the detail states include: Standby—synchronizing with active910, Standby—synchronized code 912, Standby—synchronized data 914,Standby—missed heartbeats (from active engine) 916, and Active—standby(engine) not available 918. Each of the detail states is describedfurther herein below.

The Determining Fail-over state 900 is the initial state of the statemachine of a Fail-over-enabled engine when the engine starts up. Whilewithin the Determining Fail-over state 900 the engine, having acurrently undetermined status, queries a fail-over service to retrievethe fail-over status of its fail-over partner. In response, thefail-over service executes an algorithm that attempts to determine thestatus of the engine's fail-over partner and, ultimately, whether theengine enters the Standby—not ready state 902 or the Active state 906.

By way of example, the fail-over service determines the status of theengine's fail-over partner by first attempting to contact the Fail-overpartner via the aforementioned RMC. However, if the fail-over partnerengine's status cannot be obtained (via the RMC) within a configuredtimeout period, then the fail-over service attempts to determine thefail-over partner engine's status via the primary network. If thefail-over partner engine's status cannot be obtained (via the primarynetwork) within a configured timeout period, then the starting enginewill assume the fail-over partner engine cannot be reached. In the eventthat the status of the partner engine can be determined, the fail-overservice executes logic resulting in one of the two engines in afail-over pair occupying an active state and the other occupying astandby state. In addition to the status (state/sub-state) of thepartner engine, such logic takes into consideration whether the partnerengine is the primary or backup engine. An exemplary state selectionscenario is described herein below.

If the fail-over partner engine cannot be reached to determine itsstatus, then the engine determines whether it can become active. Anengine can become active if: (1) a valid checkpoint file that representsthe last known running state of an engine exists, and (2) all codemodules that are needed to restore the objects from checkpoint exist onthe node where the engine is running. If the engine cannot become activethen the engine will continue trying to determine the status of itsfail-over partner.

The engine remains in the determining fail-over state 900 until thefail-over service establishes an appropriate fail-over state, and theengine enters either the active state 906 or the standby—not ready state902. The following summarizes the paths out of the determining fail-overstate 900. If the fail-over partner engine can't be reached, and theengine can become active then the engine: restores all hosted objectsfrom a checkpoint; schedules the hosted objects for execution; placesthe restored objects in their appropriate scan state as determined bycheckpoint values identifying the most recent scan state of the engine;starts executing objects; and transitions to the Active—standby notavailable state 918.

If the fail-over partner status is known, then the next state the engineenters depends on the fail-over status of the partner. The state machinetransitions from the determining fail-over state 900 to the Active state906 if the fail-over status of the partner is either: Standby—not readystate 902, Standby—synchronizing with active state 910, or Standby—readystate 904. On the other hand, the state machine transitions from theDetermining state 900 to the Standby—not ready state 902 if thefail-over status of the partner is either: Active—standby not availablestate 918, Active state 906, or Standby—missed heartbeats state 916. Ifthe fail-over status of the partner engine is the Determining Fail-overstate 900, then the fail-over service will direct its engine totransition from the Determining state 900 to the Active state 906 if thepartner engine is configured as the backup engine of the fail-over pair.If the partner engine is the primary engine, then the engine's statemachine enters the Standby—not ready state 902.

With regard to the Active state 906, the fail-over engine state machinetransitions from the Standby—ready state 904 to the Active state 906when a fail-over on the active partner engine has been detected. Whilewithin the Active state 906, the engine schedules hosted applicationobjects, and passes synchronization updates, including checkpoint dataand subscriber list updates, to the standby engine via the RMC. Theengine state machine transitions from the Active state 906 to theStandby—not ready state 902 if commanded to become a standby engine.Alternatively, the engine state machine transitions to theActive—standby not available state 918 if the engine cannot contact orloses contact with the partner engine.

With regard to the Standby—ready state 904, a standby engine enters theStandby—ready state 904 after transitioning from the Standby—not readystate 902 through a set of intermediate synchronization states/stages910, 912 and 914 (described herein below) wherein the code and data hasbeen synchronized with the active partner engine. While withinStandby—ready state 904, the application engine performs a set of tasksdiffering from the tasks executed by an active application engine.

By way of example, while in the Standby—ready state 904, the applicationengine monitors the active partner engine for failure (e.g., verifyingreceipt of heartbeats from the active engine over both a primary networkand over the RMC within a configured timeout period). Furthermore, thestandby engine seeks to maintain certain information in synch with thatof its active partner through incremental updates while within theStandby—ready state 904. However, in some cases, rather than merelyperform an incremental update, the fail-over pair execute a completere-synchronization of their information. In such case, the standbyengine transitions from the Standby—ready state 904 to theStandby—synchronizing with active state 910 when the standby engine isnotified that its information (updated via the RMC) is out of synch withits active partner. The standby engine receives, through the RMC,synchronization information from the active engine. The synchronizationinformation includes checkpoint deltas/changes from the active engine.The checkpoint deltas are changes to checkpoint attribute values,associated with application objects hosted by the active engine, duringa scan. Examples of checkpointed data include: configuration and tuninginformation relating to application objects, alarm limits, and the setof application objects deployed on the engine (including any neededcode/data files used by the application objects). The standby enginealso determines whether checkpoint deltas from the active engine havebeen missed and ensures that it has a consistent checkpoint snapshot. Inaddition to the above-noted checkpoint deltas, the standby enginepotentially receives from the active engine via the RMC othersynchronization information including: notifications when a clientengine subscribes/un-subscribes to/from the active engine, alarm statechanges (time stamped), and history blocks placed in a store-forwardmemory of the active engine.

One contemplated use of a fail-over application engine configurationinvolves providing fail-over functionality for a data acquisitionservice that transfers data to a networked process managementinformation database. In the case where an application engine isconfigured to manage store-forward operations for a data acquisitionserver, configuring a fail-over store-forward engine arrangement andmaintaining a copy of the active engine's store-forward memory limitsthe loss of data waiting to be transferred from the active engine'sstore-forward memory to a history database when fail-over occurs.Furthermore, if fail-over occurs while the active engine is instore-forward mode, then the standby engine takes over and continues inthe store-forward mode until an intended destination of thestore-forward data (e.g., a process information database) becomesavailable. When the destination database becomes available, thestore-forward data acquired by the failed engine as well as thestore-forward data subsequently acquired by the currently active(previously standby) engine are forwarded to the database.

The following summarizes the behavior of active and standby applicationengines including store-forward functionality. The store-forwardfunctionality facilitates storing historical process and manufacturinginformation when a data path from the active engine to a historicaldatabase server is obstructed/interrupted. Historical data is processedthe same on a fail-over enabled engine as on a non fail-over enabledengine when no failure is detected. Historical data is sent to thehistorical database server only from the active engine. The activeengine processes historical data and sends it to the historical databaseserver when the database server is available. If the historical databasebecomes unavailable (or a transmit data buffer becomes backed up due toa slow link), then the active engine stores the historical data locallyand forwards the data when the historian becomes available. It is notedthat, in an illustrative embodiment, loss of connectivity to thehistorical database does not initiate a fail-over. If an active engineloses connectivity to the historian and its standby engine can connectto the historian, then the active engine enters the store-forward mode,will commence sending store-forward updates via the RMC, and will notfail-over.

When an active application engine enters a store-forward mode ofoperation, the active engine synchronizes its store-forward data withits partner standby engine. The standby engine receives all of itsstore-forward data from its active engine. Thus upon notification ofbeing started in a standby mode, the standby engine checks to see if ithas data within its store-forward memory. If such data is present, it ispurged and the standby engine waits for store-forward data from itsactive partner engine during an initial data synchronization stage.

In an embodiment of the invention, store-forward informationsynchronization is executed between active and standby engines accordingto a configurable repetition period. By way of example, store-forwarddata is written to memory in the active engine every 30 seconds.Synchronizing store-forward memory between active/standby engines alsotakes place every 30 seconds. Under this update scheme no more than 30seconds of store-forward information from a previously active engine islost during engine fail-over.

In the event of fail-over the data acquisition service hosted by thestandby engine is activated and takes the place of the data acquisitionservice hosted by the formerly active engine. If the data acquisitionservice's previously active engine was in store-forward mode then thenewly active engine will be capable of continuing store-forwardfunctionality without connecting to the historian. When connectivity tothe historical database is restored, identical store-forward datacollected by either engine of a fail-over pair is forwarded to thedatabase from the currently active engine.

To facilitate management of store-forward data collected across multiplefailures, and to improve diagnostics, the application engine statusinformation includes attributes summarizing a current store-forwardstatus of the engine. By way of example, the attributes specify valuesindicating: store-forward data has been collected for engine,store-forward data is currently synchronized with the standby engine,store-forward data is not synchronized with the standby engine, and timespan of the store-forward data (identified by a start time and endtime).

Resuming the description of the tasks performed by the engine while inthe Standby—ready state, the standby engine also verifies that it issynchronized with the active engine. A standby engine is synchronizedwith its corresponding active engine if: (1) files installed on theactive engine's node (specified through a deployment operation) areinstalled on the standby node; (2) all checkpoints that exist in theactive engine's checkpoint file also exist in the standby engine'scheckpoint file; and (3) the standby engine has verified that it has notmissed any delta checkpoints, alarm state changes, or history blocks. Inan illustrative embodiment, only files installed on an active node as aresult of a deployment operation to that node are considered by thestandby when it verifies synchronization of files. Files installedoutside a deployment operation are not considered.

Multiple exit paths exist from the Standby—ready state 904. Theapplication engine state machine transitions to the Active state 906,described herein above, in response to receiving a command to becomeactive. Alternatively, the state machine enters theStandby—synchronizing with active state 910 in response to receivingnotification that it is no longer synchronized with the active engine.Still another transition path brings the state machine to theStandby—missed heartbeats state 916 when a configurable set ofheartbeats have been missed from the active engine.

With regard to the Standby—not ready state 902, an engine enters theStandby—not ready state 902 from any one of multiple states. The statemachine transitions to the Standby—not ready state 902 if the standbyengine has determined it has missed checkpoints and/or alarm statechanges from the active engine while at the Standby—synchronized datastate 914. Such communication failures are typically caused bycommunication failures in the RMC. However other sources of suchfailures include checkpoints, alarm states, and history blocks beingsent faster than the standby engine can process them and alarm statechanges being sent so quickly they can't be processed fast enough by thestandby engine. Such failures can be avoided by adding/increasing thecapacity of buffers for the data transferred via the RMC.

The state machine also transitions to the Standby—not ready state 902when new objects are deployed to the active engine. The deployment ofnew objects to an engine in the Active state 906 causes the creation ofcheckpoints on the active engine and the installation of code modulesrequired by the deployed objects. If the state machine is in theStandby—ready state 904 at the time new files need to be installed onthe standby engine, then the state machine transitions to theStandby—not ready state 902 (or if the active engine is already detectedthen transitioning directly to the Standby—synchronizing with activestate 910). The state machine also enters the Standby—not ready state902 from either the Standby—synchronizing with active state 910,Standby—synchronized code state 912 or the Standby—synchronized datastate 914 if the standby engine detects that communications with theactive engine via the RMC are lost before the standby engine completessynchronization and enters the Standby—ready state 904.

While within the Standby—not ready state 902, the standby engineattempts to perform tasks needed to ultimately transition to theStandby—ready state 904 by synchronizing code modules and data with theactive engine while successfully progressing through states 910, 912 and914. In the illustrative embodiment of the present invention, theprogression begins with a transition from the Standby—not ready state902 to the Standby—synchronizing with active state 910 afterestablishing communications with the active engine via the RMC.

With regard to the Active—standby not available state 918, theapplication engine state machine transitions into the Active—standby notavailable state 918 from either an active or a standby state. The statemachine transitions from the Active state 906 to the Active—standby notavailable state 918 if a communication failure with the standby engine,via the RMC, is sensed when transmitting the following synchronizationinformation: checkpoint deltas, subscription notifications, or alarmstate changes. A failure to transmit a store-forward history block tothe standby engine will not cause a transition to the standby notavailable state 918 from the Active state 906.

The active engine periodically receives heartbeats from itscorresponding standby engine. If a (configurable) time period forreceiving a heartbeat from a standby engine expires, then the activeengine state machine transitions from the Active state 906 into theActive—standby not available state 918. Furthermore, in an embodiment ofthe invention, the heartbeat is an indicator of a healthy platform/node,and therefore multiple heartbeats will not be sent from a platformhosting multiple standby engines to a node hosting corresponding activeengines. Instead, one heartbeat message is sent from a platform hostingthe multiple standby engines to the platform hosting the correspondingactive engines. The frequency of heartbeats, sent from a node Y havingstandby engines, to a node X with active engines is the smallestconfigured timeout for all active engines deployed to node X that have astandby engine deployed to node Y. Alternatively, where a heartbeat isintended to indicate the health of each engine, separate heartbeats areissued for each fail-over engine. In such instances multiple heartbeatsare issued between a first platform hosting multiple standby engines anda second engine hosting corresponding active engines.

The application engine state machine transitions into the Active—standbynot available state 918 from the Active state 906 if the active enginereceives notification, via the RMC, that the standby engine isunavailable. Examples of when such transitions occur include when thestandby engine has been shutdown and is therefore no longer running.

A standby engine's state machine transitions from the Standby—missedheartbeats state 916 to the Active—standby not available state 918 ifthe standby engine has missed a configurable number of consecutiveheartbeats from the active engine via the RMC (causing an initialtransition of the standby engine's state machine from the ready state904 to the missed heartbeats state 916), and an independent monitorissues a command to the standby engine to become active. Monitoring forfailures of an active engine is discussed further herein below.

While in the Active—standby not available state 918 an activeapplication engine hosts execution of application objects that aredeployed on scan to the application engine. The active applicationengine periodically checks to see if the standby engine can be contactedvia the RMC. Because there is no standby, the active application enginecannot be manually switched to standby (because of the absence of acurrent standby engine). Furthermore, the active application engine willnot attempt to send checkpoint deltas (changes), subscriptionnotifications, alarm state changes, and store-forward history datablocks—that are typically passed, via the RMC, to the standby engine.

The state machine transitions out of the Active—standby not availablestate 918, and into the Active state 906, if a connection isre-established with an operational corresponding standby engine.

With regard to the Standby—missed heartbeats state 916, a standby enginetransitions from the Standby—ready state 904 into the Standby—missedheartbeats state 916 if a heartbeat has not been received, via the RMCor primary network, by the standby engine from the active partner'sfail-over service within a configured timeout period (determined, forexample by a heartbeat time out limit parameter value and consecutivemissed heartbeats parameter value). Consistent with the arrangement forsending heartbeats from a standby node to an active node, a singleheartbeat is sent from a node hosting multiple active engines to anothernode hosting their corresponding standby engines. The repetition periodof heartbeats, sent from the active engine's fail-over service on activenode X, to a standby node Y is the smallest configured timeout for allactive engines deployed to node X that have a standby engine deployed tonode Y. Other potential events causing a transition to theStandby—missed heartbeats state 916 include: the active engine failingor hanging (determined by the active engine's fail-over service througha separate timeout mechanism—see active engine timeout 1140 describedherein below); and the active engine shutting down gracefully. In thelatter instance, the standby engine will be notified that it is totransition to the Active—standby not available state 918.

While within the Standby—missed heartbeats state 916 logic is performedto determine why the standby engine missed the heartbeats and whetherthe state machine for the standby engine will transition to an activemode of operation or remain in the standby mode (transitioning either tothe Standby—ready state 904 or the Standby—not ready state 902).Referring to FIG. 10, during step 1000 a fail-over service for thestandby engine checks for/monitors heartbeats from the active enginethrough both the primary network and the RMC (e.g., network 119 and link140 of FIG. 1). At step 1002, if a currently configured number ofconsecutive heartbeats, sent via the RMC, have been missed, then controlpasses to step 1004. At step 1004 the fail-over service determineswhether the active engine's node can be reached via the primary network.If the active engine's node can be reached via the primary network, thenthe RMC link is assumed to be down and control passes to step 1006wherein the engine's state machine enters the Standby—not ready state902.

Otherwise, if at step 1004 the active engine cannot be reached, thencontrol passes to step 1007. At step 1007, if at least one other nodecannot be reached via the primary network, then a communication problemprobably exists in the host of the standby engine and control passes tostep 1006 and the state machine enters the Standby—not ready state 902.If at least one other node can be reached via the primary network, thenfurther tests are performed to determine whether the current activeengine has failed and thus control passes to step 1008. At step 1008, ifanother platform node can access the active engine's node, then theactive engine is assumed to still be available (and the problem lieswith the standby engine's node), and control passes to step 1006.Otherwise, if at step 1008 none of the nodes can see the active engine'snode, then the malfunction likely originates from the active engine'snode. Control therefore passes to step 1010 wherein the standby engineenters the active engine mode. Because the fail-over partner is assumedto be out of service, during step 1010 the state machine transitionsfrom the Standby—missed heartbeats state 916 to the Active—standby notavailable state 918.

Returning to step 1002, if the currently configured number ofconsecutive heartbeats sent via the RMC have not been missed, thencontrol passes to step 1020 wherein the fail-over service checks whetherheartbeats sent via the primary network have been missed. If aconfigurable number of heartbeats have not been missed, then controlpasses to step 1006 and the standby engine enters the Standby—not readystate (since there is apparently a problem with the RMC connectionsupporting communications between the active and standby engines).

However, if consecutive heartbeats have been missed via the primarynetwork then control passes from step 1020 to step 1022. At step 1022connectivity tests are performed to determine whether the active andstandby engines can reach at least one other platform via the primarynetwork. Thereafter, at step 1024 if at least one platform can bereached by the active engine's node via the primary network, thencontrol passes to step 1026. At step 1026, if the standby node can reachat least one other platform via the primary network, then it is assumedthat a connectivity problem exists, on the primary network, between thenodes hosting the active and standby engines. Therefore control passesfrom step 1026 to step 1028 and the standby engine's state machineenters the Standby—ready state 904. Otherwise a connectivity problemapparently exists between the node hosting the standby engine and allother nodes, control passes from step 1026 to step 1006, and the statemachine transitions from the Standby—missed heartbeats state 916 to theStandby—not ready state 902.

Returning to step 1024, if the active engine's node cannot reach anyother node on the primary network, then control passes to step 1030. Atstep 1030, if at least one node can be reached via the primary networkfrom the standby node, then the active engine's primary network adapterhas apparently failed and the standby should take over for the failingactive engine in servicing requests from clients of the applicationengine. Therefore, control passes from step 1030 to step 1032. At step1032 the current active engine is directed to enter a standby mode andthe standby engine is commanded to enter an active mode. Control thenpasses from step 1032 to step 1010. Otherwise, if not even one node canbe reached by the standby node via the primary network, then controlpasses from step 1030 to step 1006.

Returning to FIG. 9, a series of states are associated withsynchronizing a standby engine and its corresponding active partnerengine via the RMC. The Standby—synchronizing with active state 910 isentered from the Standby—not ready state 902 when the active engine isdetected by the host of the standby engine via the RMC. As notedpreviously above, a backup/standby engine does not receive code modulesfor supporting application objects via the primary network, and insteadreceives such code modules from the primary/active engine via the RMC.While within the Standby—synchronizing with active state 910, thestandby application engine synchronizes its code modules with the activeengine. Therefore, any code modules on the standby engine that do notexist on the active engine are uninstalled, and any code modules on theactive engine that are not installed on the standby engine are installedon the standby engine's node. Once the code modules are synchronized,the state machine transitions to the Standby—synchronized code state912.

While within the Standby—synchronized code state 912, the standby enginesynchronizes its checkpoint data and other snapshot information,including subscriber information, with the active engine. Thesynchronization comprises: deleting checkpoint data (including objectinformation) or subscriber information in the standby engine's recordsthat do not exist in the active engine; and adding checkpoint data(including object information) or subscriber information to the standbyengine's records that exists on the active engine but not on the standbyengine. If communication is lost over the RMC while the state machine isin the Standby—synchronized code state 912, then the state machinetransitions to the Standby—not ready state 902. However, uponsuccessfully completing synchronizing the object information, checkpointdata, and subscriber information the state machine transitions to theStandby—synchronized data state 914. While operating within theStandby—synchronized data state 914, a standby application enginecompletes its data synchronization processing (e.g. updating databasesand directories in view of the transferred synchronization information)and transitions to the Standby—ready state 904. However, ifcommunication is lost between the primary and standby engines while thestate machine is in the Standby—synchronized data state 914, then thestate machine transitions to the Standby—not ready state 902.

While operating in a fail-over mode, the active and standby enginesmaintain awareness of one another's status through alarms. A summary isprovided herein below of the various alarm states and their role ingoverning the transitions and operation of the state machines.

Below is summary of the various alarms associated with fail-over thatwill be reported when standby and active engines transition between thepreviously described fail-over states. The alarm description of all thealarms reported contain: the engine fail-over partner's node name, thesummary state and detail state if applicable, the node name of theengine reporting the alarm, and the name of the engine reporting thealarm. To simplify Table 1, only summary states are identified. Anytransitions from a previous state to a current state or sub-state of thecurrent state will cause this alarm to occur. TABLE 1 Previous CurrentAlarm cleared Alarm Alarm Name State State when enter reported byStandby not Active Standby - Standby - ready Active engine Ready Notready Standby not Active Active - Active Active engine available standbynot available

With regard to the Standby not ready alarm, the active engine monitorsthe status of the standby engine, via the RMC, to determine when toraise the alarms mentioned in Table 1. Furthermore, if the active engineis in the standby unavailable state this alarm will not be generated.

Table 2 summarizes the alarms reported whenever a fail-over occurs.TABLE 2 Alarm Name Alarm raised when Alarm cleared when Alarm reportedby Fail-over occurred When standby becomes During the next scan ofActive engine active the active engine Standby history When Activeengine fails to When the history data in Active engine data out of syncupdate standby with history active and standby blocks engines are insync Standby alarm data When Active engine fails to When the alarm datain Active engine out of sync update standby with alarm active andstandby data engines are in sync.

In addition to the above alarms, the consecutive heartbeats missed overRMC and consecutive heartbeats missed over primary network will beprovided as attributes that can be extended by the user to report alarmsif desired.

Turning briefly to FIG. 11, a set of timers/limits are identified thatare associated with the fail-over engines. These timers are utilized toensure proper tracking of the health of the fail-over engine pair andthe networks and hosts through which they communicate. A primary networkcommunication timeout 1100 is used, by way of example, by an enginewhile in the determining fail-over state 900 when the engine attempts todetermine the state of its fail-over partner via the primary network(e.g., network 119). The primary network timeout is independentlyconfigurable, and exists as an attribute which can be modified atconfiguration time and runtime.

A standby engine heartbeat timeout 1110 is used, by way of example, bythe active engine while in the active state 906 to determine whether theactive engine has lost communication with the standby engine via theRMC. The heartbeat timeout is configurable both at runtime andconfiguration time from the active engine, is deployed over to theactive engine, persists across engine restarts, and is assigned adefault of 2 seconds.

An active engine heartbeat timeout 1120 is used, by way of example, bythe standby engine while in the standby—ready state 904 to determinewhether the standby engine has missed heartbeats from its fail-overpartner via the RMC. A missed heartbeat is registered (and the standbyengine transitions to the standby—missed heartbeats state 916) if thestandby engine has not received, via the RMC, a heartbeat from itsactive engine partner within the time period specified by the activeengine heartbeat timeout. The active engine heartbeat timeout isconfigurable at configuration time and runtime from the active engine,persists across engine restarts, and is assigned a default of 5 seconds.

A consecutive heartbeats missed limit 1130 specifies the consecutivenumber of heartbeats missed between the active and standby engines viathe primary network or RMC (utilized during the Standby—missedheartbeats state 916). The consecutive heartbeats missed limit isconfigurable from the active engine at configuration time and runtime,persists across engine restarts, and has a default value of 2. Thedefault value of 2 implies that 2 heartbeats must be missed in a row inorder for the consecutive number of heartbeats missed condition tobecome true and cause a fail-over. Missing a single heartbeat brings thestandby engine's state machine into the Standby-missed heartbeats state916.

An active engine timeout limit 1140 specifies a timeout period withinwhich an active engine must notify its fail-over service, running on thesame platform as the active engine, that it's still functional. If thetimeout period is exceeded, the system will determine that the activeengine has failed or hung and initiate a fail-over sequence wherein astandby partner of a fail-over configured engine is commanded to becomeactive, and clients/subscribers are informed of fail-over relatedevents. The active engine timeout limit is configurable duringconfiguration and runtime, persists across engine restarts, and adefault value is specified in a primitive specification.

A subscribed engine node connection timeout 1150 specifies a periodutilized by the standby—missed heartbeats fault resolution scheme (see,FIG. 10) to wait for a response from nodes that have engines subscribedto the active engine to determine whether they can see the activeengine. The subscribed engine node connection timeout is configurable atconfiguration and runtime, persists across engine restarts, and adefault value is specified in a primitive specification.

Detecting Active Engine Failures Effecting Clients/Subscribers

Another aspect of the runtime operation of a fail-over host pair isreliably detecting an active host malfunction and ensuring thatclient/subscribers timely re-connect to a (previously) standby host whenfail-over occurs. A monitoring scheme is described herein below thatreduces communications load associated with monitoring the operationalstatus of an active hosts while maintaining a high degree of confidencethat when an active host ceases to function, the failure is detected andclients of the failed active host quickly reconnect to the (previously)standby host of a fail-over pair.

A first aspect of sensing engine/host failures involves detectingfailure of a node upon which an active host currently resides. One wayto monitor the status of a node is the use of heartbeats. However,heartbeats consume network resources and tie up computing resources.Therefore, in an exemplary embodiment, heartbeats associated with nodestatus are limited with regard to their intended recipients. Heartbeatsare not sent by a publisher (e.g. an application engine) toclients/subscribers. For example, heartbeats are not sent to a plantfloor visualization application instance that subscribes to a tag on anapplication object hosted by an engine. In cases where aclient/subscriber and a publisher are on differing nodes, heartbeats aresent between nodes (platforms) hosting publishers (engines) and relatedclients/subscribers. When heartbeats are expected by a node and they arenot received within a configured time period, then a monitoringmechanism assumes that: the node or it's network adapter have failed,and the network path between the two nodes has failed. The rate at whichheartbeats are sent between two nodes: is configurable on a platformboth at runtime and configuration time (limited at runtime to users withtuning permissions); persists across platform restarts; is a minimum of250 milliseconds, and defaults to 2500 milliseconds.

An error is sensed when a configurable number of consecutive heartbeatsare missed. The number of consecutive heartbeats missed by a nodehosting an active engine of interest: is configurable on a platform bothat runtime and configuration time (limited at runtime to users withtuning permissions); persists across platform restarts; and will defaultto two. If a configured number of consecutive heartbeats from a node ismissed, then a failure of the node, from which the heartbeats wereexpected, is assumed, and all clients that expect data from this failednode are notified of the assumed failure by monitoring services residingon their host nodes.

A second aspect of sensing engine/host failures involves detecting afailure of the engine itself (without its host node/platform goingdown). In contrast to using heartbeats, a separate monitoring processdetermines, and informs clients, that a particular engine is no longeravailable for a broad variety of circumstances. Examples of suchcircumstances include when the application engine has been shut down,failed (e.g., crashed unexpectedly), or hung (i.e., though stilloperating, is not receiving/responding to messages passed to it by theplatform upon which it resides).

Referring to FIG. 12, a sequence of steps summarize a progression ofstages associated with monitoring for and responding to an activeapplication engine failure by notifying a messaging infrastructureserving clients/subscribers (e.g., WONDERWARE's INTOUCH human machineinterface) to the engine located on other network nodes so that themessaging infrastructure for the clients/subscribers can take steps toupdate data connections to reference the (previous) standby partner of afail-over engine configuration. In summary, rather than rely upontransmitting a periodic heartbeat to each client, a separate process,executing upon a same machine as the active engine, monitors the activeengine's health. The monitoring process notifies a fail-over service ofthe standby engine when a failure of the active engine is detected.Thereafter, the fail-over service informs the messaging infrastructureserving clients/subscribers to the failed-over engine of the new activeengine's status.

During stage 1200, a separately executing monitoring process (e.g., thebootstrap process on the computing system upon which an applicationengine is running) monitors the health of the active application engine.The monitoring process receives periodic notifications from the activeapplication engine according to a time interval. The interval isindividually configurable for each engine both at runtime andconfiguration time. However, runtime configuration will be limited tousers having tune permissions, and the interval persists across enginerestarts. Monitoring the health of an engine by a process operating on asame node reduces network workload in comparison to a scheme whereclients are individually informed of an engine's health via heartbeats.

During stage 1210 the monitoring process detects the active applicationengine has shutdown, crashed, or hung. In response, at stage 1220 themonitoring process initiates notifying the standby engine that thepreviously active engine is not operational. By way of example, themonitoring process notifies a fail-over service on its own machine that,in turn, notifies (via the RMC) a fail-over service on the same platformas the standby engine, that the standby engine is to become active.

During step 1225 the fail-over service on the standby engine, utilizingthe subscriber/client information previously passed via the RMC, issuesan active engine failure notification to the messaging infrastructure(e.g., local message exchange—LMX) for each client/subscriber to thefailed engine. The active engine failure notification messageidentifies: the failed engine (by handle), the new active engine (byhandle), and a time period within which the new active engine willcomplete startup.

At step 1230, the standby engine transitions from the Standby—readystate 904 to the Active state 906 (see, FIG. 9). By way of example, thefail-over service updates the status of the standby engine to reflectthat the engine is transitioning to active status (state transition fromStandby—Ready 904 to Active 906 in FIG. 9). Thereafter, the fail-overservice directs the standby engine to commence running in the activestate (e.g., invoke startup methods on each of its hosted applicationobjects, etc.). The now active application engine notifies the fail-overservice when its startup procedures are complete. In response thefail-over service updates the status of the “transitioning” engine toreflect that the engine is now active (see, Active state 906) andexecuting its hosted application objects.

Thereafter, at step 1240, the fail-over service for the now active(previously standby) engine, utilizing the subscriber/client informationpreviously passed via the RMC, notifies the messaging infrastructure(e.g., LMX) for subscribers/clients (e.g., INTOUCH plant/processvisualization application) that the former standby engine is now theactive partner. The active status notification message to the messaginginfrastructures that serve subscribing clients includes: an engineidentification of the now active engine (by handle) and an “active”status identifier.

Thereafter at step 1250, with complete transparency to theclient/subscriber, the messaging infrastructures update their routingtables with regard to all references affected by the change to the newactive engine. The message exchange handle for each data/attributereference previously associated with the failed engine replaced by ahandle corresponding to the reference on the new active engine of thefail-over pair. As a consequence, without changing any reference stringsused by the client/subscribers (i.e., with client-transparency), alldata subscriptions with the failed active engine are re-routed/connectedto the new active engine.

In the role-based redundant engine arrangement described herein above,the primary and backup engines, while hosted by distinct physicalplatforms, are treated as a single logical entity (e.g., clientreferences to objects/attributes hosted on the engine pair partner donot distinguish between the two entities that make up the fail-overengine pair) within a global/unified name space. A same name is assignedto both the primary and backup engines of a fail-over pair, and theengines are distinguished by operations performed based upon theircurrent role/status. Therefore, clients/subscribers of a redundantengine issue their requests to a logical fail-over enabled engine entityencompassing both the primary and backup engines. The messaging andnaming services transparently resolve the reference/name strings to anidentifier for the currently active application engine of the fail-overenabled engine pair without any knowledge of the clients. Thispotentially results in a streamlined process for: switching activeserver/publisher engines in a fail-over pair, and relocating applicationengine objects to new platforms within a network.

Upon receiving notification that the standby engine is now runningactive, the messaging component (e.g., Message Exchange) switches, byway of example, a set of three different types of references toattributes from the failed engine to the new active engine.

-   -   Supervisory references—including references for: modifying        attributes (supervisory sets that are not subject to security),        monitoring changes to attribute (supervisory gets with        subscription), and retrieving data from an attribute        (supervisory gets without subscription).    -   User references—including references for: modifying attributes        (user sets associated with logged on users that are subject to        security), monitoring changes to an attribute (user gets with        subscription), retrieving data from an attribute (user gets        without subscription), and pre-binding references.    -   System references—including references for: modifying attributes        (system sets such as ones associated with a global network        repository/database of system information), monitoring changes        to attribute (system gets with subscription), and retrieving        data from an attribute (system gets without subscription).

In an exemplary embodiment, the process of switching references istransparent to message exchange clients. The clients utilizelocation-independent names from a global namespace (maintained by theglobal name table 125) to reference attributes associated with thefail-over enabled application engines. As a result, when fail-over to astandby engine on a different network node occurs, none of the referencenames used by the clients change (since the reference names are equallyapplicable to an activated primary or backup application engine).

After the former standby engine commences operating as the activeengine, clients receive a data update, for subscriptions, containing thecurrent value of the attribute on the newly activated engine. If thedelta/delay time between when the client engine receives notification ofthe active engine's failure and the time the client engine receivesnotification that the standby has become active exceeds a configuredlimit then the quality of data associated with all referenced attributeswill be set to “bad” until receiving the data updates from the newlyactivated engine. The configured limit (with a default of 15 seconds) isconfigurable at runtime and configuration time for all engines withinthe scope of the global namespace, and persists across engine restarts.

Global Namespace/Relocating an Active Engine

The above-described fail-over engine configuration and deploymentarchitecture is integrated with a global/unified name space thatsupports network location independence through name-based access to theapplication engines. The engines are identified by location-independentnames. In the global namespace, references are resolved from physicallocation-independent references to network addresses by a name service.Under such circumstances, when an engine relocates, only the nameservice needs to be informed of the new address for the named engine.The name/reference associated with the relocated engine islocation-independent, and therefore does not change when the engine ismoved to a new platform within a network. Contact with a relocatedapplication engine is established by its clients through re-bindingrequests submitted to the naming service.

Turning to FIG. 13, a configuration database interface is summarizedthat facilitates the above-described fail-over functionality in a host(e.g., an application engine that supports a set of application objects)in a process control and manufacturing information system environment.

An IFailOverConfiguration interface 1300 is a primary interface forcreating a fail-over host (e.g., application engine pair). TheIFailOverConfiguration interface 1300 includes a set of methodsincluding a CreateBackupEngine method 1310. The CreateBackupEnginemethod 1310 creates a backup fail-over engine object in theconfiguration database 124. The CreateBackupEngine method 1310, ifsuccessful, returns a pointer/reference to an identification for thenewly created backup engine object. A DeleteBackupEngine method 1320deletes a previously created backup fail-over engine object from theconfiguration database 124. The DeleteBackupEngine method 1320 is calledif, during configuration of an application engine a user did not checkthe Enable redundancy checkbox 404. A GetBackupEngine method 1330returns a reference to a backup engine object. A ValidateHostedEnginesmethod 1340 validates (checks configuration) of all application enginesassigned to an identified platform.

An IPackageManager interface 1350, a general interface that manages theobject packages within the configuration database 124, comprises aGetFailOverPartnerId method 1360. The GetFailOverPartnerId method 1360receives, as input, an identification of a fail-over partner engineobject. The GetFailOverPartnerId method 1360 returns a reference to thepartner engine object. An ObjectStatus method 1370 returns a set ofstatus bits corresponding to the present status of an applicationobject. Exemplary status information includes whether the object is: atemplate, hidden, checked out, pending update, deployed, primary engine,backup engine, and fail-over enabled.

In view of the many possible embodiments to which the principles of thisinvention may be applied, it should be recognized that the embodimentsdescribed herein with respect to the drawing figures are meant to beillustrative only and should not be taken as limiting the scope of theinvention. Furthermore, the illustrative steps may be modified,supplemented and/or reordered without deviating from the invention.Therefore, the invention as described herein contemplates all suchembodiments as may come within the scope of the following claims andequivalents thereof.

1. A redundant host pair runtime arrangement for a process controlnetwork environment comprising: a primary network; a first partner of afail-over host pair, operating on a first machine communicativelyconnected to the primary network, the first partner hosting a set ofexecuting application components in accordance with an active roleassigned to the first partner; a second partner of the fail-over hostpair, operating on a second machine communicatively connected to theprimary network, the second partner hosting a non-executing version ofthe set of executing application components in accordance with a standbyruntime role; and a monitoring process, operating separately upon thefirst machine, for sensing a failure of the first partner, and inresponse, initiating a notification to the second partner to take overthe active role.
 2. The redundant host pair runtime arrangement of claim1, wherein the second partner receives updates including enginesynchronization data associated with the set of executing applicationcomponents to facilitate taking over the active role currently assignedto the first partner.
 3. The redundant host pair runtime arrangement ofclaim 2 further comprising: a redundancy message channel, separate anddistinct from the primary network, providing a communications pathbetween the first machine and second machine facilitating passing theupdates including engine synchronization data.
 4. The redundant hostpair runtime arrangement of claim 3 wherein the engine synchronizationinformation comprises checkpoint information.
 5. The redundant host pairruntime arrangement of claim 4 wherein the checkpoint informationcomprises a set of objects deployed upon the first partner.
 6. Theredundant host pair runtime arrangement of claim 4 wherein thecheckpoint information comprises alarm limits.
 7. The redundant hostpair runtime arrangement of claim 4 wherein the checkpoint informationcomprises object configuration information.
 8. The redundant host pairruntime arrangement of claim 3 wherein the engine synchronizationinformation comprises alarm states.
 9. The redundant host pair runtimearrangement of claim 3 wherein the engine synchronization informationcomprises subscriber lists for a data acquisition engine.
 10. Theredundant host pair runtime arrangement of claim 3 wherein the enginesynchronization information comprises store forward buffered historizedprocess control data acquired by the active host of the redundant hostpair in accordance with a set of executing application objects.
 11. Theredundant host pair runtime arrangement of claim 1 wherein the secondpartner further comprises logic for independently determining afail-over condition without receiving a failure notification from thefirst machine running the active host of the redundant host pair, andthereafter taking over the active role.
 12. The redundant host pairruntime arrangement of claim 11 wherein the fail-over conditioncomprises losing communication contact with the first partner.